The goal of this exercise is to investigate if and how systematic sub-sampling in situations where \(n>>p\) can be applied to imbalanced learning. The note is structured as follows: to set the stage for the final investigation the first section briefly introduces the bias-variance trade-off. The following section then introduces various subsampling methods. Section 3 illustrates the improvements associated with non-uniform subsampling. The final two sections then apply the ideas to binary classification problems with imbalanced training data.
source("R/gauss_kernel.R")
source("R/sinusoidal.R")
source("R/sim_sinusoidal.R")
source("R/regularized_ls.R")
# Generate data: ----
n <- 25
p <- 24
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
v <- 0.3
As our data generating process for \(y\) we will consider the sinusoidal function \(f(x)=\sin(2\pi x)\) as in Bishop (2006). To simulate random samples of \(y\) we sample \(n\) input values from \(x \sim \text{unif}(0,1)\) and introduce some white noise \(\varepsilon \sim \mathcal{N}(0,0.3)\). Figure @ref(fig:p_sim) shows \(y\) along with random draws \(y^*_n\).
# True data: ----
library(ggplot2)
library(data.table)
dt_true <- data.table(y,x)
pl <- ggplot(data=dt_true, aes(x=x, y=y)) +
geom_line()
# Simulate data: ----
n_draws <- 3
dt_star <- rbindlist(
lapply(
1:n_draws,
function(x) {
simulated <- sim_sinusoidal(n=n, sigma = v)
data.table(y=simulated$y_star, x=simulated$x_star, n=1:n, draw=x)
}
)
)
pl +
geom_point(
data = dt_star,
aes(x=x, y=y, colour=factor(draw))
) +
scale_colour_discrete(
name="Draw:"
)
(#fig:p_sim)Sinusoidal function and random draws.
lambda <- c(
exp(2.6),
exp(-0.31),
exp(-2.4)
)
s <- 0.1
n_draws <- 100
mu <- seq(a,b,length.out = p)
As in Bishop (2006) we will use a Gaussian linear model with Gaussian kernels \(\exp(- \frac{(x_k-\mu_p)^{2}}{2s^2})\) to estimate \(\hat{y}_k\) from our data \((y_k,x_k)\)
\[ \begin{equation} \begin{aligned} && y|x& =f(x) \sim \mathcal{N} \left( \sum_{j=0}^{p-1} \phi_j(x)\beta_j, v \mathbb{I}_p \right) \\ \end{aligned} \tag{1.1} \end{equation} \]
with \(v=0.3\). We fix the number of kernels \(p=24\) (and hence the number of features \(M=p+1=25\)) as well as the spatial scale \(s=0.1\). To vary the complexity of the model we use a form of regularized least-squares (Ridge regression) and let the regularization parameter \(lambda\) vary
\[ \begin{equation} \begin{aligned} && \hat\beta&=(\lambda I + \Phi^T \Phi)^{-1}\Phi^Ty \\ \end{aligned} \tag{1.2} \end{equation} \]
where high values of \(\lambda\) in (1.2) shrink parameter values towards zero. (Note that a choice \(\lambda=0\) corresponds to the OLS estimator.) As in Bishop (2006) we proceed as follows for each choice of \(\lambda\) and each sample draw:
Phi <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
dt <- rbindlist(
lapply( # loop - draw K times
1:n_draws,
function(k) {
# Draw:
simulated <- sim_sinusoidal(n=n, sigma = v)
y_k <- simulated$y_star
x_k <- simulated$x_star
rbindlist(
lapply( # loop over regularization parameter
1:length(lambda),
function(t) {
# Regularization parameter:
lambda_t <- lambda[t]
# Extract features:
Phi_k <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x_k, mu=mu_p, s = s)
}
)
)
beta_hat <- regularized_ls(Phi_k,y_k,lambda_t) # fit model on (y,x)
y_hat <- c(Phi %*% beta_hat) # predict from model
dt <- data.table(value=y_hat,draw=k,lambda=lambda_t,n=1:n,x=x)
return(dt)
}
)
)
}
)
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(lambda,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(lambda=lambda)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
dt,
dt_exp,
dt_true,
fill=T
)
Recall that for the mean-squared error (MSE) we have
\[ \begin{equation} \begin{aligned} && \mathbb{E} \left( (\hat{f}_n(x)-f(x))^2 \right) &= \text{var} (\hat{f}_n(x)) + \left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \\ \end{aligned} \tag{1.3} \end{equation} \]
where the first term on the right-hand side corresponds to the variance of our prediction and the second term to its (squared) bias. Applying the above procedure we can construct the familiar picture that demonstrates how increased model complexity increases variance while reducing bias (Figure 1.1).
dt_plot[,log_lambda := log(lambda)]
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
geom_line() +
facet_grid(
rows = vars(log_lambda),
cols = vars(facet_group)
)
Figure 1.1: Bias-variance trade-off
source("R/UNIF.R")
source("R/BLEV.R")
source("R/OPT.R")
source("R/PL.R")
source("R/wls_qr.R")
source("R/rorthogonal.R")
source("R/sim_subsampling.R")
n <- 1000
m <- 10
p <- 10
Let us now consider the case for sub-sampling: \(n >> p\). We continue with the sinusoidal function from above but now look at the case where the number of observations is \(n=1000\) and the number of features is \(p=10\). Suppose we are interested in estimating \(\hat\beta_m\) instead of \(\hat\beta_m\) where \(p\le m=10<<n\) with \(m\) freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large \(n\). Or perhaps we are only allowed to load to access \(m\) observations at a time. The basic algorithm for performing estimating \(\hat\beta_m\) is as follows:
But there are at least two questions about this algorithm: firstly, how do we choose \(X_m=x^{(1)},...,x^{(m)}\)? Secondly, how should we construct \(\hat\beta_m\)? With respect to the former, a better idea than just randomly selecting \(X_m\) might be to choose observations with high influence. We will look at a few of the different subsampling methods investigated and proposed in Zhu et al. (2015), which differ primarily in their choice of subsampling probabilities \(\{\pi_i\}^n_{i=1}\):
The PL method is proposed by the authors and shown to scale very well and is good approximation conditional on leverage scores \(h_{ii}\) being fairly homogenous.
With respect to the second question Zhu et al. (2015) investigate both ordinary least-squares (OLS) and weighted least-squares (WLS), where weights simply correspond to subsampling probabilities \(\{\pi_i\}^n_{i=1}\). The authors present empirical evidence that OLS is more efficient than WLS in that the mean-squared error (MSE) for predicting \(\Phi \beta\) is lower for OLS. Unfortunately though subsampling using OLS is not consistent for non-uniform subsampling methods meaning that the bias cannot be controlled. Given Equation (1.3) this implies that the variance of predictions is higher with WLS and that this effect dominates the effect of relatively higher bias with OLS. In fact this is consistent with the theoretical results presented in Zhu et al. (2015) (more on this below).
Next we will briefly run through different subsampling and estimation methods ins some more detail and see how they can be implemented in R. In the following section we will then look at how the different approaches perform empirically.
Both OLS and WLS are implemented using QR decompostion. As for OLS this is very easily done in R. Given some feature matrix Phi and a corresponding outcome variable y we can use qr.solve(Phi, y) compute \(\hat\beta\). For WLS we need to first weigh observations by their corresponding subsampling probabilities. Following Zhu et al. (2015) we can construct a weighting matrix \(\Omega= \text{diag}\{\pi_i\}^m_{i=1}\) and compute the weighted least-squares estimator as:
\[ \begin{equation} \begin{aligned} && \hat\beta_m^{WLS}&= \left( \Phi^T \Omega^{-1} \Phi \right)^{-1} \Phi^T\Omega^{-1}y\\ \end{aligned} \tag{2.1} \end{equation} \] The derivation for (2.1) is as follows (note that here I’m using the same notation as in Zhu et al. (2015) where \(\Phi\) corresponds to our \(\Omega\) and \(X\) to our \(\Phi\)):
Weighted least-squares
In R this can be wrapped up in a simple function:
wls_qr <- function(X, y, weights) {
Phi <- diag(weights)
X_weighted <- sqrt(qr.solve(Phi)) %*% X
y_weighted <- sqrt(qr.solve(Phi)) %*% y
beta <- qr.solve(X_weighted, y_weighted)
return(beta)
}
A simple function for uniform leveraging in R is shown in the code chunk below. Note that to streamline the comparison of the different methods in the following section the function takes an unused argument weighted=F which for the other subsampling methods can be used to determine wether OLS or WLS should be used. Of course, with uniform leveraging the weights are all identical so \(\hat\beta^{OLS}=\hat\beta^{WLS}\) so the argument is passed to but not evaluated in UNIF. (There are likely more elegant ways to do this!)
UNIF <- function(Phi, y, m, weighted=F, rand_state=NULL) {
if (!is.null(rand_state)) {
set.seed(rand_state)
}
indices <- sample(1:n, size=m)
Phi_m <- Phi[indices,]
y_m <- y[indices]
beta_hat <- qr.solve(Phi_m, y_m)
y_hat <- c(Phi %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat
)
)
}
The UNIF function can be extended easily to the case with basic leveraging (see code below). Note that in this case the weighted argument is evaluated.
BLEV <- function(Phi, y, m, weighted=F, rand_state=NULL, plot_wgts=F, prob_only=F) {
svd_Phi <- svd(Phi)
U <- svd_Phi$u
H <- tcrossprod(U)
h <- diag(H)
prob <- h/ncol(Phi)
# Plot:
if (plot_wgts) {
plot(prob, t="l", ylab="Sampling probability")
}
# Output:
if (prob_only) {
return(prob)
} else {
indices <- sample(
x = 1:n,
size = m,
replace = T,
prob = prob
)
Phi_m <- Phi[indices,]
y_m <- y[indices]
weights <- prob[indices]
if (weighted) {
beta_hat <- wls_qr(Phi_m, y_m, weights)
} else {
beta_hat <- qr.solve(Phi_m, y_m)
}
y_hat <- c(Phi %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat,
prob = prob
)
)
}
}
Recall that for the hat matrix we have
\[ \begin{equation} \begin{aligned} && \mathbb{H}&=\Phi (\Phi^T \Phi)^{-1}\Phi^T \\ \end{aligned} \tag{2.2} \end{equation} \]
where the diagonal elements \(h_{ii}\) correspond to the leverage scores we’re after. Following Zhu et al. (2015) we will use (compact) singular value decomposition to obtain \(\mathbb{H}\) rather than computing (2.2) directly. This has the benefit that there exist exceptionally stable numerical algorithms to compute SVD. (source) To see why we can use SVD to obtain \(\mathbb{H}\) consider the following:
From SVD to leverage
Clearly to get \(h_{ii}\) we first need to compute \(\mathbb{H}\) which in terms of computational costs is of order \(\mathcal{O}(np^2)=\max(\mathcal{O}(np^2),\mathcal{O}(p^3))\). The fact that we use all \(n\) rows of \(\Phi\) to compute leverage scores even though we explicitly stated our goal to only use \(m\) observations is a bit of a paradox. This is why fast algorithms that approximate leverage scores have been proposed. We will not look at them specifically here mainly because the method proposed by Zhu et al. (2015) promises to be computationally even more efficient.
The basic characteristic of PL subsampling - choosing \(\{\pi_i\}^n_{i=1}= ||x_i||/ \sum_{j=1}^{n}||x_j||\) - was already introduced above. Again it is very easy to modify the subsampling functions from above to this case:
PL <- function(Phi, y, m, weighted=F, rand_state=NULL, plot_wgts=F, prob_only=F) {
predictor_len <- sapply(
1:nrow(Phi),
function(i) {
norm(as.matrix(Phi[i,]), type="f")
}
)
prob <- sapply(predictor_len, function(i) i/sum(predictor_len))
# Plot:
if (plot_wgts) {
plot(prob, t="l", ylab="Sampling probability")
}
# Output:
if (prob_only) {
return(prob)
} else {
indices <- sample(
x = 1:n,
size = m,
replace = T,
prob = prob
)
Phi_m <- Phi[indices,]
y_m <- y[indices]
weights <- prob[indices]
if (weighted) {
beta_hat <- wls_qr(Phi_m, y_m, weights)
} else {
beta_hat <- qr.solve(Phi_m, y_m)
}
y_hat <- c(Phi %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat,
prob = prob
)
)
}
}
In fact, this PL subsampling is an approximate version of optimal subsampling (OPT). Zhu et al. (2015) show that variance \(V=\text{var} (\hat{f}_n(x))\) dominates squared bias \(\left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2\). More specifically they show that asymptotically
\[ \begin{equation} \begin{aligned} && V&\rightarrow m^{-1} \\ \end{aligned} \tag{2.3} \end{equation} \]
and
\[ \begin{equation} \begin{aligned} && \left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2&\rightarrow (m^{-1})^2 \\ \end{aligned} \tag{2.4} \end{equation} \]
where \(m\) is the size of the subsample as before. Given this result minimizing the MSE (Equation (1.3)) with respect to subsampling probabilities \(\{\pi_i\}^n_{i=1}\) is equivalent to minimizing \(V\). They further show that this minimization problem has the following closed-form solution:
\[ \begin{equation} \begin{aligned} && \pi_i&= \frac{\sqrt{(1-h_{ii})}||x_i||}{\sum_{j=1}^n\sqrt{(1-h_{jj})}||x_j||}\\ \end{aligned} \tag{2.5} \end{equation} \]
This still has computational costs of order \(\mathcal{O}(np^2)\). But now it becomes clear why PL subsampling is optimal conditional on leverage scores being homogenous:
From OPT to PL
PL subsampling is associated with computational costs of order \(\mathcal{O}(np)\) so a great improvement in particular when both \(n\) and \(p\) are large.
n <- 1000
p <- 100
As discussed in Zhu et al. (2015) both OPT and PL subsampling tend to inflate subsampling probabilities of observations with low leverage scores and shrink those of high-leverage observations relative to BLEV. They show explicitly that this always holds for orthogonal design matrices. As a quick sense-check of the function introduced above we can generate a random orthogonal design matrix \(X\) and plot subsampling probabilities with OPT and PL against those obtained with BLEV. Figure 2.1 illustrates this relationship nicely. I have generated \(X\) \((n \times p)\) with \(n=1000\) and \(p=100\) using SVD:
rorthogonal <- function(n,p) {
M <- matrix(rnorm(n*p), n, p)
X <- svd(M)$u
return(X)
}
X <- rorthogonal(n,p)
# Subsampling methods:
methods <- list(
"UNIF" = UNIF,
"BLEV" = BLEV,
"OPT" = OPT,
"PL" = PL
)
smpl_probs <- rbindlist(
lapply(
names(methods)[2:4],
function(i) {
method <- i
prob <- methods[[method]](X, y=NULL, m=NULL, prob_only=T)
return(data.table(prob=prob, method=method))
}
)
)
dt_plot <- rbindlist(
lapply(
names(methods)[3:4],
function(i) {
other_method <- i
x <- smpl_probs[method=="BLEV"]$prob
y <- smpl_probs[method==other_method]$prob
data.table(x=x, y=y, y_var=other_method)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_grid(
cols=vars(y_var)
) +
labs(
x="BLEV",
y=""
)
Figure 2.1: Comparison of subsampling probabilities.
To illustrate the improvements associated with the methods proposed in Zhu et al. (2015), we will briefly replicate their main empirical findings here. The evaluate the performance of the different methods we will proceed as follows:
Numerical exercise
In R we will use the following function for this purpose:
sim_subsampling <- function(Phi, y, m, subsample_estimator, J=1000, bias_correct=F, B=1000, ...) {
# Compute full sample OLS estimator:
y_hat_ols <- qr.fitted(qr.default(Phi),y)
# Compute J predictions from subsample estimator:
run_J_predictions = function() {
output <- rbindlist(
lapply(
1:J,
function(j) {
estimate <- tryCatch(
subsample_estimator(Phi, y, m, ...), # subsample estimate
error = function(e) {
return(list(fitted=NA))
}
)
if (bias_correct) {
# Bootstrap bias-correction term:
w <- 0
} else {
w <- 0
}
data.table(
y_hat_ols = c(y_hat_ols),
y_hat_subsample = estimate$fitted,
i = 1:length(y_hat_ols),
j = j,
w = w
)
}
)
)
# Compute variance, bias and MSE:
output <- na.omit(output) # for cases where estimation yielded NaNs (can happen when m->p)
output[,avg_y_hat_subsample := mean(y_hat_subsample), by=.(i)]
V <- output[,.(V_b=(1/n)*norm(as.matrix(y_hat_subsample - avg_y_hat_subsample), type="f")^2),by=.(j)][,mean(V_b)]
bias_sq <- output[,.(avg_bias_i = mean(y_hat_ols - y_hat_subsample)), by=.(i)][,(1/n)*norm(as.matrix(avg_bias_i, type="f")^2)]
mse_check <- output[,.(mse_b=(1/n)*norm(as.matrix(y_hat_ols - y_hat_subsample), type="f")^2),by=.(j)][,mean(mse_b)]
mse <- V + bias_sq
if (abs(mse_check-mse)>1e-5) {
warning(
sprintf(
'Inconsistent values for MSE and MSE from decomposition: %0.5f',
mse_check-mse
)
)
}
output <- data.table(
value = c(V,bias_sq,mse),
variables = c("V", "Squared bias", "MSE")
)
}
# Run with tryCatch:
output <- tryCatch(
run_J_predictions(),
error=function(e) {
warning("Error occured. Returning NaNs.")
output <- data.table(
value = rep(NA, 3),
variables = c("V", "Squared bias", "MSE")
)
return(output)
}
)
return(
output
)
}
n <- 1000
p <- 3
As in Zhu et al. (2015) we will generate the design matrix \(X\) from 5 different distributions: 1) Gaussian (GA) with \(\mathcal{N}(\mathbf{0},\Sigma)\); 2) Mixed-Gaussian (MG) with \(0.5\mathcal{N}(\mathbf{0},\Sigma)+0.5\mathcal{N}(\mathbf{0},25\Sigma)\); 3) Log-Gaussian (LN) with \(\log\mathcal{N}(\mathbf{0},\Sigma)\); 4) T-distribution with 1 degree of freedom (T1) and \(\Sigma\); 5) T-distribution as in 4) but truncated at \([-p,p]\). All parameters are chosen in the same way as in Zhu et al. (2015) with exception of \(n=1000\) and \(p=3\), which are significantly smaller choices in order to decrease the computational costs. The corresponding densities of the 5 data sets are shown in Figure 5.1 in the appendix.
We will run the numerical exercise for each data set and each subsampling method introduced above. Figure ?? shows logarithms of the sampling probabilities corresponding to the different subsampling methods (UNIF not shown for obvious reasons). The plots look very similar to the one in Zhu et al. (2015) and is shown here primarily to reassure ourselves that we have implemented their ideas correctly. One interesting observation is worth pointing out however: note how the distributions for OPT and PL have lower standard deviations compared to BLEV. This should not be altogether surprising since we already saw above that for orthogonal design matrices the former methods inflate small leverage scores while shrinking high scores. But it is interesting to see that the same appears to hold for design matrices that are explicitly not orthogonal given our choice of \(\Sigma\).
set.seed(1)
library(expm)
matrix_grid <- expand.grid(i=1:p,j=1:p)
Sigma <- matrix(rep(0,p^2),p,p)
for (x in 1:nrow(matrix_grid)) {
i <- matrix_grid$i[x]
j <- matrix_grid$j[x]
Sigma[i,j] <- 2 * (0.8)^(abs(i-j))
}
# 1.) Design matrix (as in Zhu et al): ----
GA <- matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Gaussian mixture:
gaus_mix <- list(
gaus_1 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma)),
gaus_2 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm((25 * t(Sigma)))
)
MG <- matrix(rep(0,n*p),n,p)
for (i in 1:nrow(MG)) {
x <- sample(1:2,1)
MG[i,] <- gaus_mix[[x]][i,]
}
# Log-Gaussian:
LN <- exp(GA)
# T-distribution:
T1 <- matrix(rt(n*p,1), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Truncated T:
TT <- T1
TT[TT>p] <- p
TT[TT<(-p)] <- -p
data_sets <- list(
GA = list(X = GA),
MG = list(X = MG),
LN = list(X = LN),
TT = list(X = TT),
T1 = list(X = T1)
)
# 2.) Outcome:
data_sets <- lapply(
data_sets,
function(i) {
X <- i[["X"]]
beta <- c(rep(1,ceiling(0.6*p)),rep(0.1,floor(0.4*p)))
eps <- rnorm(n=n,mean=0,sd=10)
y <- X %*% beta + eps
list(X=X, y=y)
}
)
saveRDS(data_sets, "outputs/synthetic_data.rds")
pgrid <- expand.grid(method = names(methods)[2:4], data=names(data_sets))
smpl_probs <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
method <- as.character(pgrid$method[i])
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- data_sets[[data_set]]$y
prob <- methods[[method]](X, y, m=NULL, prob_only=T)
return(data.table(prob=prob, method=method, data=data_set))
}
)
)
saveRDS(smpl_probs, "outputs/smpl_probs.rds")
smpl_probs <- readRDS("outputs/smpl_probs.rds")
pgrid <- expand.grid(method = names(methods)[3:4], data=names(data_sets))
dt_plot <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
other_method <- as.character(pgrid$method[i])
data_set <- pgrid$data[i]
x <- smpl_probs[method=="BLEV" & data==data_set]$prob
y <- smpl_probs[method==other_method & data==data_set]$prob
data.table(x=x, y=y, y_var=other_method, data=data_set)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_wrap(
y_var ~ data,
scales = "free",
ncol=length(data_sets)
) +
labs(
x="BLEV",
y=""
)
ggplot(
data = smpl_probs,
aes(x=data, y=log(prob))
) +
geom_boxplot() +
facet_grid(
col=vars(method)
)
Figure 3.1: Sampling probabilities for different subsampling methods.
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
temp <- max(p,(2^(counter)*1e-3) * n)
m <- unique(c(m, temp))
counter <- counter + 1
}
weighted <- c(T,F)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- data_sets[[data_set]]$y
output <- sim_subsampling(X, y, m, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_zhu.rds")
grid_search <- readRDS("outputs/grid_search_zhu.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m/n, y=log_value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling method:"
) +
labs(
x="m/n",
y="Logarithm",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
Figures 3.2 and 3.3 show the resulting MSE, squared bias and variance for the different subsampling methods and data sets using weighed least-squares and ordinary least-squares, respectively. The subsampling size increases along the horizontal axis.
For the data sets that are also shown in Zhu et al. (2015) we find the same overall pattern: PL and OPT outperform other methods when using weighted least-squares, while BLEV outperforms other methods when using unweighted/ordinary least-squares. For Gaussian data (GA) the differences between the methods are minimal since data points are homogenous. In fact, Zhu et al. (2015) recommend to just rely on uniform subsampling when data is Gaussian. Another interesting observations is that for t-distributed data (T1) the non-uniform subsampling methods significantly outperform uniform subsampling methods. This is despite the fact that in the case of T1 data the conditions used to establish asymptotic consistency of the non-uniform subsampling methods in Zhu et al. (2015) are not fullfilled: in particular the fourth moment is not finite (in fact it is not defined). This is curious: even though we may not want to use subsampling at all if this condition is not met, we should at least employ non-uniform subsampling if we really do have to estimate \(\beta\) through subsampling.
p_list$WLS
Figure 3.2: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted least-squares.
p_list$OLS
Figure 3.3: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary least-squares.
source("R/undersampling.R")
source("R/sim_undersampling.R")
source("R/logit_irls.R")
In binary classification problems we are often faced with the issue of imbalanced training data - one of the two classes is under-represented relative to the other. This generally makes classifiers less sensitive to the minority class which often is the class we want to predict (Branco, Torgo, and Ribeiro (2015)). Suppose for example we wanted to predict the probability of death for patients who suffer from COVID-19. The case-fatality rate for the virus is significantly lower than 10% so any data we could obtain on patients would inevitably be imbalanced: the domain is skewed towards the class we are not interested in predicting.
A common and straight-forward way to deal with this issue is to randomly over- or under-sample the training data. Let \(y_i\in{0,1}\) for all \(i=1,...,n\). We are interested in modelling \(p_i=P(y_i=1|x_i)\) but our data is imbalanced: \(n_{y=0}>>n_{y=1}\) where \(n=n_{y=0}+n_{y=1}\). Then random over- and under-sampling works as follows:
In a way both these methods correspond to uniform subsampling (UNIF) discussed above. Random oversampling may lead to overfitting. Conversely, random undersampling may come at the cost of elimination observations with valuable information. With respect to the latter, we have already shown that more systematic subsampling approaches generally outperform uniform subsampling in linear regression problems. It would be interesting to see if we can apply these ideas to classification with imbalanced data. How exactly we can go about doing this should be straight-forward to see:
To remain in the subsampling framework we will focus on undersampling here. It should be noted that many other more sophisticated approaches to undersampling already exist. In their extensive survey Branco, Torgo, and Ribeiro (2015) mention undersampling methods based on distance-criteria, condensed nearest neighbours as active learning methods. Optimal subsampling as in Zhu et al. (2015) is not mentioned.
n <- length(data_sets$GA$y)
prob <- 0.05
# y <- sample(c(1,0), n, T, prob=c(prob,1-prob))
In this section we will use the same synthetic data sets as above for our design matrix \(X\). But while above we sampled \(y_i |x_i\sim \mathcal{N}(\beta^T x_i,\sigma)\), here we will simulate a single draw from \(y |X \sim \text{Bernoulli}(p)\), where in order to create imbalance we set \(p=0.05\) (an overestimate of the true case-fatality rate of COVID if you want). To model the probability of \(y=1\) we will use logistic regression:
\[ \begin{equation} \begin{aligned} && p&= \frac{ \exp( \mathbf{X} \beta )}{1 + \exp(\mathbf{X} \beta)} \end{aligned} \tag{4.1} \end{equation} \]
Equation (4.1) is not estimated directly but rather derived from linear predictions
\[ \begin{equation} \begin{aligned} \log \left( \frac{p}{1-p} \right)&= \mathbf{X} \beta \\ \end{aligned} \tag{4.2} \end{equation} \] where \(\beta\) can be estimated through iterative re-weighted least-squares (IRLS) which is simple implementation of Newton’s method (see for example Wasserman (2013); a complete derivation can also be found in the appendix):
\[
\begin{equation}
\begin{aligned}
&& \beta_{s+1}&= \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right) \mathbf{X}^T \mathbf{W}z\\
\text{where}&& z&= \mathbf{X}\beta_{s} + \mathbf{W}^{-1} (y-p) \\
&& \mathbf{W}&= \text{diag}\{p_i(\beta_{s})(1-p_i(\beta_{s}))\}_{i=1}^n \\
\end{aligned}
\tag{4.3}
\end{equation}
\]
In R this can be implemented as below. The function also implements weighted logistic regression which we will use in the same way we used weighted least-squares above. Alternatively one could of course just use glm([formula], family="binomial") to do this in R.
logit_irls <- function(X, y, beta_0=NULL, tau=1e-9, max_iter=10000, weights=NULL) {
if(!all(X[,1]==1)) {
X <- cbind(1,X)
}
p <- ncol(X)
n <- nrow(X)
if (!is.null(weights)) {
Phi <- diag(weights)
X <- sqrt(qr.solve(Phi)) %*% X
y <- sqrt(qr.solve(Phi)) %*% y
}
# Initialization: ----
if (is.null(beta_0)) {
beta_latest <- matrix(rep(1, p)) # naive first guess
}
W <- diag(n)
can_still_improve <- T
iter <- 1
# Iterative reweighted least-squares (IRLS):
while(can_still_improve & iter < max_iter) {
y_hat <- X %*% beta_latest
p_y <- exp(y_hat)/(1+exp(y_hat))
df_latest <- crossprod(X,y-p_y) # gradient
diag(W) <- p_y*(1-p_y)
Z <- X %*% beta_latest + qr.solve(W) %*% (y-p_y)
beta_latest <- qr.solve(crossprod(X,W%*%X),crossprod(X,W%*%Z))
can_still_improve <- mean(abs(df_latest))>tau # convergence reached?
iter <- iter + 1
}
return(
list(
fitted = p_y,
coeff = beta_latest
)
)
}
When thinking about repeating the numerical exercise from Section 3 above, we are faced with the question of how to compute the MSE and its bias-variance decomposition. One idea could be to compare the linear predictions in (4.2) in the same way as we did before, since after all these are fitted values from a linear model. The problem with that idea is that rebalancing the data through undersampling affects the intercept term \(\beta_0\) (potentially quite a bit if the data was originally very imbalanced). This is not surprising since \(\beta_0\) measures the log odds of \(y_i\) given all predictors \(\mathbf{X}_i=0\) are zero. Hence comparing linear predictors \(\mathbf{X}\beta_n\) and \(\mathbf{X}\beta_m\) is not a good idea in this setting. Fortunately though we can just work with the predicted probabilities in (4.1) directly and decompose the MSE in the same way as before (see Manning, Schütze, and Raghavan (2008) pp. 310):
\[ \begin{equation} \begin{aligned} && \mathbb{E} \left( (\hat{p}-p)^2 \right) &= \text{var} (\hat{p}) + \left( \mathbb{E} \left( \hat{p} \right) - p \right)^2 \\ \end{aligned} \tag{4.4} \end{equation} \] Hence the numerical exercise in this section is very similar to the one above and can be summarized as follows:
Numerical exercise
The \(b\)-th insample MSE, variance and squared bias are estimated as:
…
The resulting mean-squared error decompositions are shown in Figures 4.1 and 4.2 for weighted and ordinary least-squares. The observed patterns are broadly in line with what we observed in Section 3.
….
This is both reassuring and we have now cleared the ground to move on to our final question - does non-juvenile undersampling produce better out-of-sample predictions? We shall turn to this in the final section where will treat real data.
# Simulation parameters:
data_sets <- tryCatch(
data_sets,
error = function(e) {
readRDS("outputs/synthetic_data.rds")
}
)
n <- 1000
m <- 2^(3:7)*1e-3
weighted <- c(T,F)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
methods <- list(
"UNIF" = UNIF,
"BLEV" = BLEV,
"OPT" = OPT,
"PL" = PL
)
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- sample(c(1,0),size=n,rep=T,prob = c(m,1-m))
vars <- undersampler(X,y) # undersampler instance
output <- sim_undersampling(vars, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_logit.rds")
grid_search <- readRDS("outputs/grid_search_logit.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m, y=value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling method:"
) +
labs(
x="Proportion of sample",
y="Value",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
p_list$`Weighted Logit`
Figure 4.1: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted logistic regression. MSE and its components correspond to linear predictions of logistic regression.
p_list$Logit
Figure 4.2: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary logistic regression. MSE and its components correspond to linear predictions of logistic regression.
Until now we have merely shown that non-uniform undersampling leads to better approximations of \(\hat\beta_n\) for logistic regression. But ultimately the goal in classification problems is to accurately predict \(y\) from test data. The results so far are a promising start: it appears that we have successfully dealt with an important issue associated with random undersampling which we already flagged above - loss of important observations. To investigate if this also leads to better out-of-sample predictions we will next turn to some real data.
dt <- fread("data/latestdata.csv")
dt <- dt[,.(ID, age, sex, latitude, longitude, outcome)]
for(col in names(dt)) {
set(dt, i=which(dt[[col]]==""), j=col, value=NA)
}
dt <- na.omit(dt)
dt[,id:=.I][,ID:=NULL]
# Preprocess outcome variable:
death_states <- unique(dt$outcome)[grepl("die",unique(dt$outcome), ignore.case = T) |
grepl("dea",unique(dt$outcome), ignore.case = T) |
grepl("decea",unique(dt$outcome), ignore.case = T)]
dt[,y:=ifelse(outcome %in% death_states,1,0)][,outcome:=NULL]
# Preprocess features:
dt[,age:=mean(as.numeric(unlist(strsplit(age,"-")))),by=id]
dt[,age:=as.numeric(age)]
dt[,sex:=ifelse(sex=="female",1,0)]
saveRDS(dt, "data/dt_preprocessed.rds")
dt <- readRDS("data/dt_preprocessed.rds")
dt <- fread("data/covid_patients.csv")
n <- dt[,.N]
state <- unique(dt$state)
Very early on as the pandemic unfolded the Korea Centers for Disease Control & Prevention (KCDC) made anonymised patient data along with a set of complementary data sets available on Kaggle. We will focus solely on the data set containing information about \(n=5165\) patients. As you can see below the data contains a set of (predominantly discrete) features describing individual subjects as well as a state column indicating whether patient has been released from the hospital, isolated in the hospital or if they have passed away.
library(DT)
datatable(
dt,
class = 'cell-border stripe',
filter = 'top',
options = list(
pageLength = 5,
dom = 't',
autoWidth = TRUE,
scrollX = TRUE
)
)
The state column can easily be transformed into a binary outcome variable \(y\in\{0=\text{survived},1=\text{deceased}\}\), which is highly imbalanced as expected:
outcome_var <- dt[,.(state)]
outcome_var[,deceased:=ifelse(state=="deceased",1,0)]
outcome_var <- outcome_var[,.(proportion=round(.N/nrow(outcome_var),3)),by=.(state,deceased)]
knitr::kable(outcome_var, col.names = c("State", "y", "Sample proportion"))
| State | y | Sample proportion |
|---|---|---|
| released | 0 | 0.567 |
| deceased | 1 | 0.015 |
| isolated | 0 | 0.418 |
In terms of selecting features we will keep things very simple and use only age and sex. Both have a significant share (around 20%) of missing values which will simply omit. As a final simplification we will treat the age variable as quasi-numeric
dt[,y:=ifelse(state=="deceased",1,0)]
dt_model <- dt[,.(y,sex,age)]
for(col in names(dt_model)) {
set(dt_model, i=which(dt_model[[col]]==""), j=col, value=NA)
}
dt_model <- na.omit(dt_model)
dt_model[,age:=as.numeric(gsub("s","",age))]
X <- dt[,.(sex, age)]
To score our probabilistic predictions we will once again use the mean-squared error, although technically in this setting we would speak of the Brier score (see here:
\[ \begin{equation} \begin{aligned} && BS&= \frac{1}{n} \sum_{i=1}^{n} (y_i-p_i)^2 \\ \end{aligned} \tag{4.5} \end{equation} \]
Of course other scoring functions are commonly considered for classification problems (for a discussion see for example here). But using the Brier score naturally continues the discussion above. It also has the benefit
data_density <- rbindlist(
lapply(
1:length(data_sets),
function(i) {
x <- c(data_sets[[i]]$X)
x_bar <- mean(c(data_sets[[i]]$X))
sigma <- sd(c(data_sets[[i]]$X))
x_std <- (x - x_bar) / sigma
data.table(value = x_std, data_set=names(data_sets)[i])
}
)
)
ggplot(data_density, aes(value)) +
geom_density() +
facet_wrap(
~ data_set,
scales = "free"
) +
labs(
x="x (standardized)",
y="Density"
)
Figure 5.1: Densities of synthetic design matrices.
When introducing the bias-variance tradeoff above we used the sinusoidal function to simulate data as in Bishop (2006). …
# Function parameters:
n <- 1000
p <- 9
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y_true <- sinusoidal(x)
y <- sim_sinusoidal(n)$y
v <- 0.3
mu <- seq(a,b,length.out = p)
Phi <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
temp <- max(p+1,(2^(counter)*1e-3) * n)
m <- unique(c(m, temp))
counter <- counter + 1
}
weighted <- c(T,F)
param_grid <- data.table(expand.grid(m=m, method=names(methods), weighted=weighted))
plot(y, t="l")
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(param_grid),
function(i) {
m <- param_grid[i,m]
estimator_name <- as.character(param_grid[i,method])
estimator <- methods[[estimator_name]]
weighted <- param_grid[i, weighted]
output <- sim_subsampling(Phi, y, m, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
grid_search[,method := factor(method, levels=c("UNIF", "BLEV", "OPT", "PL"))]
saveRDS(grid_search, file="outputs/grid_search_sinus.rds")
grid_search <- readRDS("outputs/grid_search_sinus.rds")
ggplot(
data=grid_search,
aes(x=m/n, y=log_value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows=vars(variables),
cols = vars(weighted)
) +
scale_color_discrete(
name="Subsampling method:"
) +
labs(
x="m/n",
y="Logarithm"
)
Iterative reweighted least-squares
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.
Branco, Paula, Luis Torgo, and Rita Ribeiro. 2015. “A Survey of Predictive Modelling Under Imbalanced Distributions.” arXiv Preprint arXiv:1505.01658.
Manning, Christopher D, Hinrich Schütze, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Cambridge university press.
Wasserman, Larry. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media.
Zhu, Rong, Ping Ma, Michael W Mahoney, and Bin Yu. 2015. “Optimal Subsampling Approaches for Large Sample Linear Regression.” arXiv, arXiv–1509.